Search CORE

39 research outputs found

Lace: non-blocking split deque for work-stealing

Author: D. Hendler
K.F. Faxén
N.S. Arora
R.D. Blumofe
S. Olivier
Publication venue: Springer International Publishing
Publication date: 01/01/2014
Field of study

Work-stealing is an efficient method to implement load balancing in fine-grained task parallelism. Typically, concurrent deques are used for this purpose. A disadvantage of many concurrent deques is that they require expensive memory fences for local deque operations.\ud \ud In this paper, we propose a new non-blocking work-stealing deque based on the split task queue. Our design uses a dynamic split point between the shared and the private portions of the deque, and only requires memory fences when shrinking the shared portion.\ud \ud We present Lace, an implementation of work-stealing based on this deque, with an interface similar to the work-stealing library Wool, and an evaluation of Lace based on several common benchmarks. We also implement a recent approach using private deques in Lace. We show that the split deque and the private deque in Lace have similar low overhead and high scalability as Wool

Crossref

University of Twente Research Information

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

Author: M. Frigo
R. Chowdhury
R. Cole
R.D. Blumofe
U.A. Acar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Porting Decision Tree Algorithms to Multicore using FastFlow

Author: A.C. Sodan
I. Park
J.E. Gehrke
J.R. Quinlan
K. Asanovic
M. Aldinucci
M. Cole
M. Coppola
M. Joshi
M. Vanneschi
M. Zaki
M.K. Sreenivas
R. Jin
R.D. Blumofe
S. Ruggieri
S. Ruggieri
T. Lim
W. Thies
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.Comment: 18 pages + cove

arXiv.org e-Print Archive

CiteSeerX

Crossref

Archivio della Ricerca - Università di Pisa

UnipiEprints

A Multi-threaded Asynchronous Language

Author: K. Honda
L. Lopes
R. Nikhil
R.D. Blumofe
V. Vasconcelos
V. Vasconcelos
V. Vasconcelos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2003
Field of study

Crossref

Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems

Author: C. Augonnet
E. Horowitz
F. Biscani
J. Reinders
M. Frigo
M. Gastineau
M. Gastineau
M. Monagan
M. Monagan
M. Monagan
P.S. Wang
R. Fateman
R.D. Blumofe
S.C. Johnson
Publication venue
Publication date: 01/01/2013
Field of study

We present a highly scalable algorithm for multiplying sparse multivariate polynomials represented in a distributed format. This algo- rithm targets not only the shared memory multicore computers, but also computers clusters or specialized hardware attached to a host computer, such as graphics processing units or many-core coprocessors. The scal- ability on the large number of cores is ensured by the lacks of synchro- nizations, locks and false-sharing during the main parallel step.Comment: 15 pages, 5 figure

arXiv.org e-Print Archive

Crossref

HAL-INSU

HAL-OBSPM

Efficient Data Race Detection for Async-Finish Parallelism

Author: C. Flanagan
C. Sadowski
D. Lea
D. Leijen
E.A. Lee
J. Mellor-Crummey
J.-D. Choi
J.K. Lee
M. Feng
R. Barik
R. Barik
R.D. Blumofe
S. Agarwal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Abstract. A major productivity hurdle for parallel programming is the presence of data races. Data races can lead to all kinds of harmful program behaviors, includ-ing determinism violations and corrupted memory. However, runtime overheads of current dynamic data race detectors are still prohibitively large (often incurring slowdowns of 10 × or larger) for use in mainstream software development. In this paper, we present an efficient dynamic race detector algorithm targeting the async-finish task-parallel parallel programming model. The async and finish constructs are at the core of languages such as X10 and Habanero Java (HJ). These constructs generalize the spawn-sync constructs used in Cilk, while still ensuring that all computation graphs are deadlock-free. We have implemented our algorithm in a tool called TASKCHECKER and eval-uated it on a suite of 12 benchmarks. To reduce overhead of the dynamic analysis, we have also implemented various static optimizations in the tool. Our experi-mental results indicate that our approach performs well in practice, incurring an average slowdown of 3.05 × compared to a serial execution in the optimized case.

CiteSeerX

Crossref

NBmalloc: Allocating Memory in a Lock-Free Manner

Author: A. Gidenstam
A. Gidenstam
Anders Gidenstam
B. Steensgaard
D. Dechev
D. Dice
H. Sundell
J.D. Valois
J.H. Hoepman
M. Herlihy
M. Herlihy
M. Herlihy
M. Michael
M. Papatriantafilou
M. Papatriantafilou
M.C. Rinard
M.M. Michael
M.M. Michael
M.M. Michael
M.P. Herlihy
Marina Papatriantafilou
P. Jayanti
P. Tsigas
P.P.Å. Larson
Philippas Tsigas
R.D. Blumofe
S. Schneider
T.L. Harris
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

MetaFork: A Framework for Concurrency Platforms Targeting Multicores

Author: A. Basumallik
A. Duran
C. Liao
C. Liao
C.E. Leiserson
E. Ayguadé
G.E. Blelloch
R.D. Blumofe
R.D. Blumofe
R.D. Blumofe
V.V. Dimakopoulos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Abstract. We present MetaFork, a metalanguage for multithreaded algorithms based on the fork-join concurrency model and targeting mul-ticore architectures. MetaFork is implemented as a source-to-source compilation framework allowing automatic translation of programs from one concurrency platform to another. The current version of this frame-work supports CilkPlus and OpenMP. We evaluate the benefits of the MetaFork framework through a series of experiments, such as nar-rowing performance bottlenecks in multithreaded programs. Our experi-ments show also that, if a native program, written either in CilkPlus or OpenMP, has little parallelism overhead, then the same property holds for its OpenMP or CilkPlus counterpart translated by MetaFork.

CiteSeerX

Crossref

Time Complexity of Distributed Topological Self-stabilization: The Case of Graph Linearization

Author: C. Scheideler
D. Dolev
F. Kuhn
M. Onus
R. Jacob
R.D. Blumofe
R.D. Blumofe
S. Dolev
T. Clouser
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

Asymmetry-Aware Scheduling in Heterogeneous Multi-core Architectures

Author: E. Ayguade
J.A. Kahle
M. Hill
N. Binkert
R.D. Blumofe
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Part 4: Session 4: Multi-core Computing and GPUInternational audienceAs threads of execution in a multi-programmed computing environment have different characteristics and hardware resource requirements, heterogeneous multi-core processors can achieve higher performance as well as power efficiency than homogeneous multi-core processors. To fully tap into that potential, OS schedulers need to be heterogeneity-aware, so they can match threads to cores according to characteristics of both. We propose two heterogeneity-aware thread schedulers, PBS and LCSS. PBS makes scheduling based on applications’ sensitivity on large cores, and assigns large cores to applications that can achieve better performance gains. LCSS balances the large core resource among all applications. We have implemented these two schedulers in Linux and evaluated their performance with the PARSEC benchmark on different heterogeneous architectures. Overall, PBS outperforms Linux scheduler by 13.3% on average and up to 18%. LCSS achieves a speedup of 5.3% on average and up to 6% over Linux scheduler. Besides, PBS brings good performance with both asymmetric and symmetric workloads, while LCSS is more suitable for scheduling symmetric workloads. In summary, PBS and LCSS provide repeatability of performance measurement and better performance than the Linux OS scheduler

Crossref